# ArxivRetriever

The `arXiv Retriever` allows users to query the arXiv database for academic articles. It supports both full-document retrieval (PDF parsing) and summary-based retrieval.

For detailed documentation of all ArxivRetriever features and configurations, head to the [API reference](https://api.js.langchain.com/classes/_langchain_community.retrievers_arxiv.ArxivRetriever.html)

## Features

- Query Flexibility: Search using natural language queries or specific arXiv IDs.
- Full-Document Retrieval: Option to fetch and parse PDFs.
- Summaries as Documents: Retrieve summaries for faster results.
- Customizable Options: Configure maximum results and output format.

## Integration details

| Retriever        | Source                       | Package                                                                      |
| ---------------- | ---------------------------- | ---------------------------------------------------------------------------- |
| `ArxivRetriever` | Academic articles from arXiv | [`@langchain/community`](https://www.npmjs.com/package/@langchain/community) |

## Setup

Ensure the following dependencies are installed:

- `pdf-parse` for parsing PDFs
- `fast-xml-parser` for parsing XML responses from the arXiv API

```npm2yarn
npm install pdf-parse fast-xml-parser
```

## Instantiation

```typescript
const retriever = new ArxivRetriever({
  getFullDocuments: false, // Set to true to fetch full documents (PDFs)
  maxSearchResults: 5, // Maximum number of results to retrieve
});
```

## Usage

Use the `invoke` method to search arXiv for relevant articles. You can use either natural language queries or specific arXiv IDs.

```typescript
const query = "quantum computing";

const documents = await retriever.invoke(query);
documents.forEach((doc) => {
  console.log("Title:", doc.metadata.title);
  console.log("Content:", doc.pageContent); // Parsed PDF content
});
```

## Use within a chain

Like other retrievers, `ArxivRetriever` can be incorporated into LLM applications via chains. Below is an example of using the retriever within a chain:

```typescript
import { ChatOpenAI } from "@langchain/openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import {
  RunnablePassthrough,
  RunnableSequence,
} from "@langchain/core/runnables";
import { StringOutputParser } from "@langchain/core/output_parsers";
import type { Document } from "@langchain/core/documents";

const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0,
});

const prompt = ChatPromptTemplate.fromTemplate(`
Answer the question based only on the context provided.

Context: {context}

Question: {question}`);

const formatDocs = (docs: Document[]) => {
  return docs.map((doc) => doc.pageContent).join("\n\n");
};

const ragChain = RunnableSequence.from([
  {
    context: retriever.pipe(formatDocs),
    question: new RunnablePassthrough(),
  },
  prompt,
  llm,
  new StringOutputParser(),
]);

await ragChain.invoke("What are the latest advances in quantum computing?");
```

## API reference

For detailed documentation of all ArxivRetriever features and configurations, head to the [API reference](https://api.js.langchain.com/classes/_langchain_community.retrievers_arxiv.ArxivRetriever.html)

## Related

- Retriever [conceptual guide](/docs/concepts/#retrievers)
- Retriever [how-to guides](/docs/how_to/#retrievers)
